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f FIELD OF THE INVENTION 

is. This invention relates generally to techniques for analyzing 
v| linked databases. More particularly, it relates to methods for 
assigning ranks to nodes in a linked database, such as any 
database of documents containing citations, the world wide web 
O or any other hypermedia database. 




g\ BACKGROUND OF THE INVENTION 

Due to the developments in computer technology and its increase 
in popularity, large numbers of people have recently started to 
frequently search huge databases. For example, internet search 
30 engines are frequently used to search the entire world wide web. 
Currently, a popular search engine might execute, over 3 0 million 
searches per day of the indexable part of the web, which has a 
size in excess of 500 Gigabytes. Information retrieval systems 
are traditionally judged by their precision and recall. What is 
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often neglected, however, is the quality of the results produced 
by these search engines. Large databases of documents such as 
the web contain many low quality documents. As a result, 
searches typically return hundreds of irrelevant or unwanted 
5 documents which camouflage the few relevant ones . In order to 
improve the selectivity of the results, common techniques allow 
the user to constrain the scope of the search to a specified 
subset of the database, or to provide additional search terms. 
These techniques are most effective in cases where the database 

10 is homogeneous and already classified into subsets, or in cases 
where the user is searching for well known and specific 
information. In other cases, however, these techniques are 
often not effective because each constraint introduced by the 
user increases the chances that the desired information will be 

15 inadvertently eliminated from the search results. 

lZJ 

'-r: Search engines presently use various techniques that attempt to 

p present more relevant documents. Typically, documents are 

ranked according to variations of a standard vector space model. 
20 These variations could include (a) how recently the document was 
p updated, and/or (b) how close the search terms are to the 

3 .3. 

beginning of the document. Although this strategy provides 
ifl search results that are better than with no ranking at all, the 

results still have relatively low quality. Moreover, when 
25 searching the highly competitive web, this measure of relevancy 
is vulnerable to "spamming" techniques that authors can use to 
artificially inflate their document's relevance in order to draw 
attention to it or its advertisements. For this reason search 
results often contain commercial appeals that should not be 
. 30 considered a match to the query. Although search engines are 
designed to avoid such ruses, poorly conceived mechanisms can 
result in disappointing failures to retrieve desired 
information. 
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Hyperlink Search Engine, developed by IDD Information Services, 
(http://rankdex.gari.coin/) uses backlink information (i.e., 
information from pages that contain links to the current page) 
to assist in identifying relevant web documents. Rather than 
5 using the content of a document to determine relevance, the 
technique uses the anchor text of links to the document to 
characterize the relevance of a document. The idea of 
associating anchor text with the page the text points to was 
first implemented in the World Wide Web Worm (Oliver A. McBryan, 
10 GENVL and WWWW: Tools for Taming the Web, First International 
Conference on the World Wide Web, CERN, Geneva, May 25-27, 
1994) . The Hyperlink Search Engine has applied this idea to 
assist in determining document relevance in a search. In 
particular, search query terms are compared to a collection of 

a 

Iks; 

•£i 15 anchor text descriptions that point to the page, rather than to' 
y a keyword index of the page content. A rank is then assigned to 

5 a document based on the degree to which the search terms match 

CO the anchor descriptions in its backlink documents. 

ry 

20 The well known idea of citation counting is a simple method for 

p determining the importance of a document by counting its number 

Pi of citations, or backlinks. The citation rank r(A) of a document 

CI which has n backlink pages is simply 

25 r(A) = n. 

In the case of databases whose content is of relatively uniform 
quality and importance it is valid to assiome that a highly cited 
document should be of greater interest than a document with only 
30 one or two citations. Many databases, however, have extreme 
variations in the quality and importance of documents. In these 
cases, citation ranking is overly simplistic. For example, 
citation ranking will give the same rank to a document that is 
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cited once on an obscure page as to a similar document that is 
cited once on a well-known and highly respected page. 

L Af To n r d i ng l yi i F- ,r^ ' n -priTnnry ■ ^hj'- ' nti of the present invention terr"^ 
provide ^ methods for ranking documents in a linked database. 

-io another nbjoct o - f tho imi^nt - ron— to provide - stich a mot^od that^ 

provides an objective ranking based on the relationship between 

documents. Another obj&cir of the invention is to -^3f^vin\? a 

/\. 

technique for ranking documents within a database whose content 



has a large variation in quality and importance. Another -o b>cctr - 
of the present invention is to provide a document ranking method 
that is scalable and can be applied to extremely large. databases 
such as the world wide web. Additional ^o^Tgcrb'g— and adv-a.nt:ages 
will become apparent in view of the following description and 
associated figures. 



^ / Jljr-j Hwway y of the invention 

<.5*re present invention .f^rhi^Tr^- thfi> — abuvtd — CTDjoctg * -fay taking 
'advantage of the linked structure of a database to assign a rank 
to each document in the database, where the document rank is a 
measure of the importance of a document. Rather than 
determining relevance from the intrinsic content of a document, 
or from the anchor text of ^bacl^links to the document, - tlio - 
.prcconl? method determines importance from the extrinsic 

A 

relationships between documents. Intuitively, a document should 

be important (regardless of its content) if it is highly cited 

by other documents. Not all citations, however, are of equal 

A 

significance. A citation from an important document is more 
important than a citation from a relatively unimportant 
document. Thus, the importance of. a page, and hence the rank 
assigned to it, should depend not just on the number of 
citations it has, but on the importance of the citing documents 
as well. This implies a recursive definition of rank: the rank 
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of a document is a function of the ranks of the documents which 
cite it. The ranks of documents may be calculated by an 
iterative procedure on a linked database. 

Because citations, or links, are ways of directing attention, 
the important documents correspond to those documents to which 
the most attention is directed. Thus, a high rank indicates 
that a document is considered valuable by many people or by 
important people. Most likely, these are the pages to which 
someone performing a search would like to direct his or her 
attention. Looked at another way, the importance of a page is 
directly related to the steady-state probability that a random 
web surfer ends up at the page after following a large number of 
links. Because there is a larger probability that a surfer will 
end up at an important page than at an unimportant page, this 
method of ranking pages assigns higher ranks to the more 
important pages . 



In one aspect of t;he invention, a computer implemented method is 
provided for ,.ea^ crra b4r ng an -- ijnportance iciiik lui N lixiliod nodc .':;^ 



of a linked 



The method comprises the steps of: 



-s-ejr ecting o - n inito r al -N -'dimensional- v e ctor 

-e^t ^Liny an appi.uxiiUd : Liuii P n to a ot - oady — s^feart- e piobdfeili^ 
Poo in accordance with the equation Pn = .A^po, where A is ajia:^NxN 
transition probability matrix having elemen^^s^ A[i][j] 
representing a probability of moving from node i^fet^ node j ; and 
(c) determining a rank r[k] for a node JsxProm a k^^ component 
of Pn. 



In a preferred embodj^^rr^nt , the matrix A is chosen so that an 
importance " rajxk^of a node is calculated, in part, from a 
weighted^stm of importance ranks of backlink nodes of the node, 
w]ijri''^^nrh of Hrr^hnr 1: 1 i nk nnrlT— r i n wri gT i 1"n r] in. drpr n dnnrr u p on 
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the total number of links in the backlink node. In addition, 
the JSOT>ortance rank of a node is calculated, in part, from a 
constanu\a representing the probability that a surfer will 

randomly jxiihg) to the node. The importance rank of a node can 
also be calculafe^, in part, from a measure of distances between 
the node and backC^k nodes of the node. The initial N- 
dimensional vector Pon>^ be selected to represent a uniform 
probability dis tributionX. or a non-uniform probability 
distribution which gives weight"^ a predetermined set of nodes. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a diagram of the relationship between three linked 
hypertext documents according to the invention. 

Fig. 2 is a diagram of a three-document web illustrating the 
rank associated with each document in accordance with the 
present invention. y y 

DETAILED DESCRIPTION 

Although the following detailed description contains many 

specifics for the purposes of illustration, anyone of ordinary 

skill in the art will appreciate that many variations and 

alterations to the following details are within the scope of the 

invention. Accordingly, the following prof crircd embodiments of 

ace- ^ 
the invention ie- set forth without any loss of generality to, 

and without imposing limitations upon, the claimed invention. 

For support in reducing the present invention to practice, the 

inventor acknowledges Sergey Brin, Scott Hassan, Rajeev Motwani, 

Alan Steremberg, and Terry Winograd. 

A linked database (i.e. any database of documents containing 
mutual citations, such as the world wide web or other hypermedia 
archive, a dictionary or thesaurus, and a database of academic 
articles, patents, or court cases) can be represented as a 
directed graph of N nodes, where each node corresponds to a web 




S96-213 



page document and where the directed connections between nodes 
correspond to links from one document to another. A given node 
has a set of forward links that connect it to children nodes, 
and a set of backward links that connect it to parent nodes. 
5 FIG. 1 shows a typical , relationship between three hypertext 
documents A, B, and C. As shown in this particular figure, the 
first links in documents B and C are pointers to document A. In 
this case we say that B and C are backlinks of A, and that A is 
a forward link of B and of C. Documents B and C also have other 
10 forward links to documents that are not shown. 

Although, the ranking method of the present invention is 
superficially similar to the well known idea of citation 

□ counting, the present method is more subtle and complex than 
^ 15 citation counting and gives far superior results. In a simple 

□ citation ranking, the rank of a document A which has n backlink 
pages is simply 

m 
m 

Si r(A) = n. 

20 

According to one embodiment of the present method of ranking, 
y the backlinks from different pages are weighted differently and 

• fi the number of links" on each page is normalized. More precisely, 

ra the rank of a page A is defined according to the present 

25 invention as 



r(A) = — + (1-a) 
N 



^r(Bi) ^ ^ ^ ^ ^ r(B^)^ 



where Bi, . . , , Bn are the backlink pages of A, r(Bi) , . . . , r(Bn) 
30 are their ranks, |Bi|,..., | Bn | are their numbers of forward 
links, and a is a constant in the interval [0., 1], and N is the 
total number of pages in the web. This definition is clearly 
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more complicated and subtle than the simple citation rank. Like 
the citation rank, this definition yields a page rank that 
increases as the number of backlinks increases. But the present 
method considers a citation from a highly ranked backlink as 
5 more important than a citation from a lowly ranked backlink 
(provided both citations come from backlink documents that have 
an equal number of forward links) . In the present invention, it 
is possible, therefore, for a document with only one backlink 
(from a very highly ranked page) to have a higher rank than 
10 another document with many backlinks (from very low ranked 
pages) . This is not the case with simple citation ranking. 

The ranks form a probability distribution over web pages, so 
that the sum of ranks over all web pages is unity. The rank of 
Ci 15 a page can be interpreted as the probability that a surfer will 
^; be at the page after following a large number of forward links. 

The constant a in the formula is interpreted as the probability 

PJ that the web surfer will jump randomly to any web page instead 

'^^ of following a forward link. The page ranks for all the pages 

p 20 can be calculated using a simple iterative algorithm, and 

^ corresponds to the principal eigenvector of the normalized link 

*S matrix of the web, as will be discussed in more detail below. 



In order to illustrate the present method of ranking, consider 
25 the simple web of three documents shown in FIG. 2. For 
simplicity of illustration, we assume in this example that r=0. 
Document A has a single backlink to document C, and this is the 
only forward link of document C, so 

30 r (A) = r (C) . 

Document B has a single backlink to document A, but this is one 
of two forward links of dociament A, so 
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r (B) = r (A) /2 . 

Document C has two backlinks. One backlink is to document B, 
and this is the only forward link of document B. The other 
backlink is to document A via the other of the two forward links 
from A. Thus 

r{C) = r(B) + r(A) /2. 

In this simple illustrative case we can see by inspection that 
r(A) = 0.4, r(B) = 0.2, and r{C) = 0.4. Although a typical value 

for a is --0.1, if for simplicity we set a = 0.5 (which 

corresponds to a 5 0% chance that a surfer will randomly jump to 
one of the three pages rather than following a forward link) , 
then the mathematical relationships between the ranks become 
more complicated. In particular, we then have 

r(A) = 1/6 + r(C)/2, 

r (B) = 1/6 + r (A) /4, and . 

r(C) = 1/6 + r(A)/4 + r{B)/2. 

The solution in this case is r(A) = 14/39, r(B) = 10/39, and 
r(C) = 15/39. 

In practice, there are millions of documents and it is not 
possible to find the solution to a million equations by 
inspection. Accordingly, in the preferred embodiment a simple 
iterative procedure is used. As the initial state we may simply 
set all the ranks equal to 1/N. The formulas are then used to 
calculate a new set of ranks based on the existing ranks. In 
the case of millions of documents, sufficient convergence 
typically takes on the order of 100 iterations. It is not 
always necessary or even desirable, however, to • calculate the 
rank of every page with high precision. Even approximate rank 
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values, using two or more iterations, can provide, very valuable, 
or even superior, information. 

The iteration process can be understood as a steady-state 
probability distribution calculated from a model of a random 
surfer. This model is mathematically equivalent to the 
explanation described above, but provides a more direct and 
concise characterization of the procedure. The model includes 
(a) an initial N-dimensional probability distribution vector po 
where each component Po[i] gives the initial probability that a 
random surfer will start at a node i, and (b) an NxN transition 
probability matrix A where each component A[i][j] gives the 
probability that the surfer will move from node i to node j . 
The probability distribution of the graph after the surfer 
follows one link is pi = Apo, and after two links the 

probability distribution is P2 = Api = A^po . Assuming this 
iteration converges, it will converge to a steady-state 
probability 



which is a dominant eigenvector of A. The iteration circulates 
the probability through the linked nodes like energy flows 
through a circuit and accumulates in important places. Because 
pages with no links occur in significant numbers and bleed off 
energy, they cause some complication with computing the ranking. 
This complication is caused by the fact they can add huge 
amounts to the "random jump" factor. This, in turn, causes 
loops in the graph to be highly emphasized which is not 
generally a desirable property of the model. In order to 
address this problem, these childless pages can simply be 
removed from the model during the iterative stages, and added 
back in after the iteration is complete. After the childless 
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pages are added back in, however, the same number of iterations 
that was required to remove them should be done to make sure 
they all receive a value. (Note that in order to ensure 
convergence, the norm of must be made equal to 1 after each 
iteration.) An alternate method to control the contribution of 
the childless nodes is to only estimate the steady state by 
iterating a small number of times. 

The rank r[i] of a node i can then be defined as a function of 
this steady-state probability distribution. For example, the 
rank can be defined simply by r[i] = Poo[i]. This method of 
calculating rank is mathematically equivalent to the iterative 
method described first. Those skilled in the art will 
appreciate that this same method can be characterized in various 
different ways that are mathematically equivalent. Such 
characterizations are obviously within the scope of the present 
invention. Because the rank of various different documents can 
vary by orders of magnitude, it is convenient to define a 
logarithmic rank 



which assigns a rank of 0 to the lowest ranked node and 
increases by 1 for each order of magnitude in importance higher 
than the lowest ranked node. 

In Q . prGfe r r rc d - embodiment, a finite number of iterations are 
performed to approximate poo. The initial distribution * can be 
selected to be uniform or non-uniform. A uniform distribution 
would set each component of po equal to 1/N. A non-uniform 
distribution, for example, can divide the initial probability 
among a few nodes which are known a priori to have relatively 



r[i] = log 



Poo[i] 
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large importance. This non-uniform distribution decreases the 
number of iterations required to obtain a close approximation to 
Poo and also is one way to reduce the effect of artificially 
inflating relevance by adding unrelated terms. 

In ^ prgif erred embodiment, the transition matrix A is given by 



^\0\iO A = ^ 11 + (i-a)B, 



10 where 11 is an NxN matrix consisting of all is, a is the 

probability that a surfer will jump randomly to any one of the N 
nodes, and B is a matrix whose elements B[i] [j] are given by 



1781^1 B[i] [j] = { i i ^ 



otherwise 



where ni is the total number of forward links from node i. The 

(l-a) factor acts as a damping factor that limits the extent to 

which a document's rank can be inherited by children documents. 
This models the fact that users typically jump to a different 

20 place in the web after following a few links. The value of a is 

typically around 15%. Including this damping is important when 
many iterations are used to calculate the rank so that there is 
no artificial concentration of rank importance within loops of 

the web. Alternatively, one may set a=0 and only iterate a few 
25 times in the calculation. 

- Thcro ^ are several ways that this method can be adapted or 
^alt^red for various purposes. As already mentioned above, 

rather than including the random linking probability a equally 

12 
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among all nodes, it can be divided in various ways among all the 
sites by changing the fl matrix to another matrix. For example, 
it could be distributed so that a random jump takes the surfer 
to one of a few nodes that have a high importance, and will not 
take the surfer to any of the other nodes. This can be very 
effective in preventing deceptively tagged documents from 
receiving artificially inflated relevance. Alternatively, the 
random linking probability could be distributed so that random 
jumps do not happen from high importance nodes, and only happen 
from other nodes. This distribution would model a surfer who is 
more likely to make random jumps from unimportant sites and 
follow forward' links from important sites. A modification to 
avoid drawing unwarranted attention to pages with artificially 
inflated relevance is to ignore local links between documents 
and only consider links between separate domains. Because the 
links from other sites to the document are not directly under 
the control of a typical web site designer, it is then difficult 
for the designer to artificially inflate the ranking. A simpler 
approach is to weight links from pages contained on the same web 
server less than links from other servers. Also, in addition to 
servers, internet domains and any general measure of the 
distance between links could be used to determine such a 
weighting . 

Additional modifications can further improve the performance of 
this method. Rank can be increased for documents whose backlinks 
are maintained by different institutions and authors in various 
geographic locations. Or it can be increased if links come from 
unusually important web locations such as the root page of a 
domain. 

Links can also be weighted by their relative importance within a 
document. For example, highly visible links that are near the 
top of a document can be given more weight. Also, links that are 
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in large fonts or emphasized in other ways can be given more 
weight. In this way, the model better approximates human usage 
and authors' intentions. In many cases it is appropriate to 
assign higher value to links coming from pages that have been 
modified recently since such information is less likely to be 
obsolete. 

; ^ie r y T e : g^Qnt method hac the advantage that the convergence is 



less expensive than building a full- text index. This speed 
allows the ranking to be customized or personalized for specific 
users. For example, a user's home page and/or bookmarks can be 
given a large initial importance, and/or a high probability of a 
random jump returning to it. This high rating essentially 
indicates to the system that the person's homepage and/or 
bookmarks does indeed contain subjects of importance that should 
be highly ranked. This procedure essentially trains the system 
to recognize pages related to the person's interests. 
The present method of determining the rank of a document can 
also be used to enhance the display of documents. In 
particular, each link in a document can be annotated with an 
icon, text, or other indicator of the rank of the document that 
each link points to. Anyone viewing the document can then 
easily see the relative importance of various links in the 
document . . 

The present method of ranking documents in a database can also 
be useful for estimating the amount of attention any document 
receives on the web since it models human behavior when surfing 
the web. Estimating the importance of each backlink to a page 
can be useful for many purposes including site design, business 
arrangements with the backlinkers, and marketing. The effect of 
potential changes to the hypertext structure can be evaluated by 
adding them to the link structure and recomputing the ranking. 




(a few hours using current processors) and it is much 
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Real usage data, when available, can be used as a starting point 
for the model and as the distribution for the alpha factor. 
This can allow this ranking model to fill holes in the usage 
data, and provide a more accurate or comprehensive picture.. 
Thus, although this method of ranking does not necessarily match 
the actual traffic, it nevertheless measures the degree of 
exposure a document has throughout the web. 

^^erh apD the .moDtL r important application of the present y a nking ^ 

_tno,M ;oA \s,alreM yqenhanC^^^ ^ 
TTrr^4* wn-i qnga i >■ nr^y^-Mnnn the quality of results from web search 

. . ^ 

engines. In this application of the present invention, -^fette^ 

M^^di^ to . . . ^ ^ 

ranking method -e€ the invention is integrated into a web search 

A 

engine to produce results far superior to existing methods in 
quality and performance. A search engine employing -feiie- ranking 
method of the present invention ^iiers- — all Lhe — artivniiLay e-s crt 
automation while producing results comparable to a human 
maintained categorized system. In this approach, a web crawler 
explores the web and creates an index of the web content, as 
well as a directed graph of nodes corresponding to the structure 
of hyperlinks. The nodes of the graph (i.e. pages of the web) 
are then ranked according to importance accord - ing to the method - 
of the present invention. 

The search engine is used to locate documents that match the 
specified search criteria, either by searching full text, or by 
searching titles only. In addition, the search can include the 
anchor text associated with backlinks to the page. This - idea ., 
has several advantages in this context. First, anchors often 
provide more accurate descriptions of web pages than the pages 
themselves. Second, anchors may exist for images, programs, and 
other objects that cannot be indexed by a text-based search 
engine. This also makes it possible to return web pages which 
have not actually been crawled. In addition, the engine can 
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compare the search terms with a list. of its backlink document 
titles. Thus, even though the text of the document itself may 
not match the search terms, if the document is cited by 
documents whose titles or, backlink anchor text match the search 
terms, the document will be considered a match. In addition to 
or instead of the anchor text, the text in the immediate 
vicinity of the backlink anchor text can also be compared to the 
search terms in order to improve the search. 

Once a set of documents is identified that match the search 
terms, the list of documents is then sorted with high ranking 
documents first and low ranking documents last. The ranking in 
this case is < fefined as - a function which combines all of the 
above factors such as the objective ranking and textual 
matching. If desired, the results can be grouped by category or 
site as well. 

It will be clear to one skilled in the art that the above 
embodiments may be altered in many ways without departing from 
the scope of the invention. Accordingly, the scope of the 
invention should be determined by the following claims and their 
legal equivalents . 
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